Crucial to all good statistical analyses is a thorough exploratory analysis. In this document, we will introduce a few techniques for developing an understanding of our dataset, including an understanding of its limitations.
Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.
library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)
If you have not done so already, you need to download the two sets of files associated with the workshop.
All the workshop script and tutorial files can be found on Github at github.com/joenomiddlename/DFO_SDM_Workshop_2020. To download these files on your computer, press on Code and then Download ZIP. Once downloaded, unzip the DFO_SDM_Workshop_2020-main.zip file. To be able to access some of the precompiled data, we need to make the unzipped folder the working directory. To do so in R Studio, navigate to the correct folder using the bottom right panel (folder view, you can use … button) and open it. Then, click “Set as Working Directory” under the tab ‘More’.
We should be inside the folder titled: ‘DFO_SDM_Workshop_2020’, you can verify this by using getwd().
Now we can load in the precompiled data
list2env(readRDS('./Data/Compiled_Data_new.rds'), globalenv())
## <environment: R_GlobalEnv>
Throughout this workshop, we use data collected on fin whales in June, July, and August across the years 2007-2009 and for 2011. There are two main types of data:
Encounters from systematic aerial surveys, found in Sightings_survey. The encounters information contain counts of fin whales, although for this tutorial we simplify the analysis and only model presence/absence. These systematic surveys also contain information on the distance of the encounters from the plane. Furthermore, they are accompanied with the aerial tracklines, found in Effort_survey. In these data objects, we combined three surveys conducted by both DFO and NOAA: 2007 T-NASS Aerial Survey (Lawson et al. 2009), NOAA NARW Surveys (Cole, NOAA), and NOAA Cetacean Surveys (Cole et al., NOAA).
Opportunistic fin whale encounters from two whale-watching operators, found in Sightings_DRWW_sp. This dataset is not accompanied by tracklines. Again, these datasets contains whale counts, but for simplicity we will analyze them as presence/absence data in these tutorials. These encounters are maintained in the DFO Maritimes Region Whale Sightings Database (MacDonald et. al. 2017).
Distance from port for both the Quoddy and Brier whale watch vessels, found in Dist_Brier and Dist_Quoddy. These covariates will be used to model effort from the whale watch vessels.
Bathymetric slope data, found in Slope. Slope data will be our main covariate in our analysis and we will explore whether the distribution of fin whales is linked to this.
These data are spatially-referenced data. The first thing we must always do is check for consistency between the coordinate reference systems (CRS) of each spatial object in use! To print the CRS of an sp object, simply add @proj4string to the name of the spatial object and run.
Sightings_DRWW_sp@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Sightings_survey@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Effort_survey@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Domain@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Slope@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
WW_ports@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Brier@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
Dist_Quoddy@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
All of the spatial objects are in lat/lon - good! For future analysis we will be projecting the data into a different coordinate reference system to better preserve euclidean distance.
Finally, let’s turn off all warnings associated with coordinate reference systems. The latest PROJ6+/GDAL3+ updates have caused many warning messages to be printed.
rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")
Our first goal is generally to plot the data on a map. The gg() and gmap() functions from the inlabru package are extremely useful at plotting spatial data! The class of spatial objects we will use are from the sp package. The classes of these objects begin with ‘Spatial’. For example SpatialPointsDataFrame.
We have written a bespoke function gg.spatiallines_mod() to easily add SpatialLinesDataFrame objects to the plots too. This will prove useful for plotting transect lines. We load the bespoke functions to the working environment now.
source('utility_functions.R')
Let’s plot our data!
If the data are in lat/lon format then the gmap() function will automatically add a terrain layer to the plots. Let’s plot the survey encounters in blue, the survey tracklines in black, the whale-watch encounters in purple, the whale watch ports in red.
gmap(Sightings_survey) +
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
Some of the maps used by default in gmap are copyrighted (e.g., Google maps, see ?get_map), see Additional tips for ways to use open-source maps for publication.
To remove the map layer, simply replace the gmap(Sightings_survey) with ggplot().
ggplot() + # Notice the empty ggplot() call
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
This plot hides some crucial information regarding the data collection. For example, the survey encounters and tracklines do not come from a single survey, or even a single organisation! Let’s plot this! The easiest way to do this is to subset the data accordingly!
table(Effort_survey$DATASET)
##
## DFO NOAA_1 NOAA_2
## 49 54 70
# there are 3 surveys
ggplot() +
gg(Domain) +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='DFO',], colour='purple') +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_1',], colour='red') +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_2',], colour='yellow')
This is problematic! The DFO tracklines (in purple) do not overlap with the two NOAA surveys! Thus, any future model will be unable to identify any differences in protocol efficiency without strict assumptions. This is because any model intercepts will be confounded with the latent spatial field. More on this later!
In addition, the surveys were conducted across 4 separate years. Let’s plot the survey tracklines by year. Do you see any cause for concern? Note that the names of the variables in the Effort_Survey are: DATASET and YEAR. Try this on your own! If you get stuck, click ‘Show Code’. Hint: The YEAR variable is of type character and contains 4 unique values (see below).
table(Effort_survey$YEAR)
##
## 2007 2008 2009 2011
## 101 20 19 33
class(Effort_survey$YEAR)
## [1] "character"
ggplot() +
gg(Domain) +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2007',], colour='purple') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2008',], colour='red') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2009',], colour='blue') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2011',], colour='yellow')
Note that the surveys from years 2007, 2008 and 2009 covered largely different regions! Again, this is problematic if we want to model any changes in the whale distribution over time! The effect of year will be confounded by the spatial field. That being said, the data from 2011 appear to be a good candidate for model comparison as the spatial range overlaps with the other 3 years’ effort. We will holdout this data as our test data and use 2007, 2008, and 2009 as our training data.
xtabs(~ YEAR + DATASET, data=Effort_survey@data)
## DATASET
## YEAR DFO NOAA_1 NOAA_2
## 2007 49 3 49
## 2008 0 20 0
## 2009 0 19 0
## 2011 0 12 21
2011’s data comes exclusively from NOAA.
For modelling, we will transform the data from lat/lon into a new “Canadian” CRS: NAD83 datum with UTM Zone 20N projection. This projection will help to preserve euclidean distance between points. We define the CRS object for this projection with EPSG code 2961:
Can_proj <- CRS("+init=EPSG:2961")
Can_proj <- fm_crs_set_lengthunit(Can_proj, unit='km')
The second line of code specifies that we want to work in units of km instead of the default meters. This can prove vital in applications to avoid numerical overflow.
To do the transformation, we will use the spTransform() function. For example, we transform the whale-watch encounters spatial object Sightings_DRWW_sp as follow:
Sightings_DRWW_sp <- spTransform(Sightings_DRWW_sp, Can_proj)
Sightings_DRWW_sp@proj4string
## CRS arguments:
## +proj=tmerc +lat_0=0 +lon_0=-63 +k=0.9996 +x_0=500000 +y_0=0
## +ellps=GRS80 +units=km +no_defs
Notice the changed output from calling @proj4string.
Please repeat this for all the spatial objects that are points, lines or polygons.
Sightings_survey <- spTransform(Sightings_survey, Can_proj)
Effort_survey <- spTransform(Effort_survey, Can_proj)
WW_ports <- spTransform(WW_ports, Can_proj)
Domain <- spTransform(Domain, Can_proj)
Transforming the ‘raster’-like SpatialPixelsDataFrame objects (Slope, Dist_Brier, and Dist_Quoddy) using spTransform would be inappropriate here. The projection leads to a curvature of the pixels. A more appropriate approach here is to use bilinear interpolation. The projectRaster() function from the raster package works great for this. This requires converting the SpatialPixelsDataFrame object into an object of type raster. This is made easy with the function raster(). Finally, to convert the raster object back into a SpatialPixelsDataFrame, we can use the as() function from the maptools package. This function is extremely useful for converting spatial objects between the popular packages: sp, spatstat, raster, and sf. We use this function substantially throughout the workshop.
Slope <- as(projectRaster(raster(Slope), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Brier <- as(projectRaster(raster(Dist_Brier), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Quoddy <- as(projectRaster(raster(Dist_Quoddy), crs=Can_proj), 'SpatialPixelsDataFrame')
Plot the (transformed) Bathymetry and Distance from Port spatial objects. We are going to combine these into a single plot using the multiplot() function from the inlabru package. This function takes as input ggplot objects and an argument layout, specifying how the plots should be arranged (see Additional tips for ways to change the layout).
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slope') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)')+
coord_fixed(ratio = 1),
layout=matrix(1:4, nrow=2, ncol=2, byrow = TRUE))
Don’t like the colour scheme? See Additional tips to learn how to define your own manually.
The Domain in its current form has a very complex coastline. In the second and third workshops, inlabru will require a triangulation mesh to be created that covers the Domain. Unfortunately, creating a triangulation mesh over a ‘wiggly’ coastline is very challenging and numerous issues can arise. For example, accurately capturing the ‘wigglyness’ of the coastline with a mesh requires it to have a large number of triangles, slowing down the computation. Conversely, failing to accurately capture the shape of the coastline can lead to some of the encounters to fall outside the mesh, forcing us to discard them!
To combat these issues, we will smooth the Domain using the functions gSimplify() and gBuffer(). gSimplify() attempts to reduce the number of segments used to define the SpatialPolygonsDataFrame. The amount of reduction is determined by the argument tol=. This should help to reduce the number of triangles required to recreate the coastline. Next, gBuffer() will extend (or buffer) the boundary of the SpatialPolygonsDataFrame by an amount determined by the argument width=. This should help to stop some of the encounters from falling outside of the mesh.
However, we should repeat the process once more for the covariates. inlabru requires that every covariate is defined at every point within the computational mesh. When the computational mesh is created later, the boundary may expand slightly. If the covariates are defined on the original domain, there may end up being points of the mesh that do not have a well-defined covariate value. To counter this, it can help to define a second extended and simplified domain for defining the covariates over. By ensuring that this second extended domain is slightly larger than the first extended domain, we can ensure that a covariate value is properly defined at every value in the computational mesh. We create these two domains below:
Domain_restricted <- gSimplify(gBuffer(Domain,width=15),tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain, color='red') +
gg.spatiallines_mod(Effort_survey, color='blue')
# Does the new domain contain all the survey tracklines?
gContains(Domain_restricted,Effort_survey)
## [1] TRUE
# Create a new domain that is slightly larger - for defining covariates later
Domain_restricted2 <- gSimplify(gBuffer(Domain,width=35),tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain_restricted2, color='green') +
gg(Domain, color='red')
# Does it contain the old
gContains(Domain_restricted2,Domain_restricted)
## [1] TRUE
Great! We have defined two new buffered and smoothed domains. The original domain (Domain) is shown in red and is where the computation will take place. The domain over which the computational mesh will be defined (Domain_restricted) is shown in black. Lastly, the domain over which the covariates will be defined (Domain_restricted2) is shown in green.
Once again, the smoothing of the coastline helps us to define a computational mesh that has fewer numerical issues. All computations will still take place over the original domain (Domain). More on that later!
Now we must extend our covariates to take ‘sensible’ values at all points in Domain_restricted2. These imputed values will not affect the model estimates as the model will be fit to the original Domain.
Our covariate Slope is well-defined on the original (un-buffered) domain as defined by the SpatialPolygons Domain. At values outside this, we choose to ‘fill-in’ these points with nearest-neighbour imputations. Next, we redefine the covariate as a log Slope variable, which we call log_Slope. We take the logarithm to reduce the range of values the covariate can take as these could cause our models to make wild predictions in the future workshops. We discuss the reasons for taking a log transform in more depth later in this session.
## define log_slope covariate
log_Slope <- Slope
# Define all values on land equal to 0
#log_Depth$FIWH_MAR_Slope[log_Depth$FIWH_MAR_Slope >= 0] <- 0
log_Slope$FIWH_MAR_Slope <- log(log_Slope$FIWH_MAR_Slope)#log(1-log_Depth$FIWH_MAR_Slope)
names(log_Slope) <- 'log_Slope'
# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
# 2) Extract values of the covariate at the new pixel locations
pixels_Domain$log_Slope <- over(pixels_Domain,log_Slope)$log_Slope
# 3) impute missing values with the nearest neighbour value
pixels_Domain$log_Slope[is.na(pixels_Domain$log_Slope)] <-
log_Slope$log_Slope[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Slope)),]),
'ppp'),as(SpatialPoints(log_Slope@coords),'ppp'), what = 'which')]
log_Slope <- pixels_Domain
# 4) Create a squared log depth covariate (for later)
log_Slope_sq <- log_Slope
names(log_Slope_sq) <- 'log_Slope_sq'
log_Slope_sq$log_Slope_sq <- log_Slope_sq$log_Slope_sq^2
# Plot the covariates with Domain_restricted overlayed
ggplot() + gg(log_Slope) + gg(Domain_restricted)
We also need to buffer and rescale the distance from port covariates Dist_Brier and Dist_Quoddy. Unlike with previous covariates, however, we will fix all buffered values on land equal to a large constant. This will help to ensure that negligible effort is recorded from the whale watch vessels on land through \(\lambda_{eff}\).
We show how to do it with the distance to Brier.
# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
# Extract the average value of distance at the newly created pixel locations
pixels_Domain$Dist_Brier <- over(pixels_Domain,Dist_Brier)$Dist_Brier
# There are some missing values due to newly created pixels being on land
# Fill in missing values with a very large value (1000km).
# This will make the effect of these regions negligible on inference as lambda_eff will be small
pixels_Domain$Dist_Brier[is.na(pixels_Domain$Dist_Brier)] <- 1e3
Dist_Brier <- pixels_Domain
# There is an infinity value at the port. Change to 0
Dist_Brier$Dist_Brier[is.infinite(Dist_Brier$Dist_Brier)] <- 0
max(Dist_Brier$Dist_Brier)
## [1] 1000
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Brier$Dist_Brier <- Dist_Brier$Dist_Brier / 980.7996
ggplot() + gg(Dist_Brier) + gg(Domain)
Now, please rescale and buffer the distance to Quoddy object.
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted2, n=100000),proj4string = Domain@proj4string),'SpatialPixels')[Domain_restricted2,]
pixels_Domain$Dist_Quoddy <- over(pixels_Domain,Dist_Quoddy)$Dist_Quoddy
pixels_Domain$Dist_Quoddy[is.na(pixels_Domain$Dist_Quoddy)] <- 1e3
Dist_Quoddy <- pixels_Domain
# There is an infinity value at the port. Change to 0
Dist_Quoddy$Dist_Quoddy[is.infinite(Dist_Quoddy$Dist_Quoddy)] <- 0
# Let's scale the Dist covariates closer to the (0,1) scale
Dist_Quoddy$Dist_Quoddy <- Dist_Quoddy$Dist_Quoddy / 980.7996
ggplot() + gg(Dist_Quoddy) + gg(Domain)
In the first lecture, we defined the species’ intensity surface \(\lambda_{true}(s)\). We interpret \(\lambda_{true}(s)\) as an ‘expected encounter rate per unit effort around s’. Consequently, \(\lambda_{true}(s)\) is proportional to the true species space use as it is proportional to the species’ utilization distribution \(\pi(s)\). Loosely speaking, \(\lambda_{true}(s)\) helps to answer the question ‘where are the species?’ and is the target quantity we wish to estimate!
However, we believe that the effort from our observers is heterogeneous across our domain \(\Omega\). This heterogeneous effort makes the estimation of \(\lambda_{true}(s)\) a challenge, since both the effort and the true species’ space use both drive the observed encounter locations. To control for effort, we attempt to estimate/approximate an effort intensity surface \(\lambda_{eff}(s)\). This defines ‘the expected amount of observer effort around s’. Loosely speaking, \(\lambda_{eff}(s)\) answers the question ‘where do the observers look?’. Both of these surfaces will be linked to a set of covariates (and later we’ll even include random fields).
For well-designed surveys, e.g. distance-sampling surveys, we have additional data available on the perpendicular distance of the encountered animal to the observer. We saw that we can use this to jointly model a detection probability function \(p(d)\). This models the ‘probability that an individual of the species is detected by an observer, given it is a distance \(d\) away’. inlabru, the package we will use in the later workshops, handles all of the necessary integrations required to jointly model the encounter locations with the encounter distances. Note that the integrations are performed across each survey trackline segment.
Finally, we defined the observed intensity of the encounters \(\lambda_{obs}(s)\) as ‘the expected density of encounters per unit area around point s’. This quantity is not the target of our inference as it is confounded by effort. Loosely speaking, \(\lambda_{obs}(s)\) helps to answer the question ‘where do we expect encounters between the species and observers to occur?’. If our approximation of the effort (\(\lambda_{eff}(s)\)), and the fitted detection probability function \(p(d)\) is good, then we hope to see:
\[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s)\]
or, for a single survey trackline segment: \[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s) p(d(s))\].
The above product decomposition reads loosely as “the expected number encounters around a point s equals the amount of time the observers spent looking around point s multiplied by the expected rate the species visits location s”. Note that \(d(s)\) above defines the perpendicular distance of the point \(s \in \Omega\) from a chosen trackline segment. Values of \(d(s)\) will be different for each trackline segment.
Later, we will be fitting spatial point process models to the encounters. A requirement for using these models is that, conditional on \(\lambda_{eff}(s)\), each encounter with the animal is an independent snapshot from its true intensity \(\lambda_{true}(\textbf{s})\).
Planned surveys can be designed to (approximately) satisfy the independence assumption. The boat speed can be fixed at a high value relative to the individuals’ speeds (no double-counting) and a narrow perpendicular field-of-view can be enforced. Then, assuming that the individuals of the target species are in equilibrium with respect to their stationary distribution \(\lambda_{true}(\textbf{s})\) (i.e. the species’ density) and locally mixing faster than the time between revisits by observers, the independence assumption may reasonably hold.
Conversely, no such assumptions can be made regarding the whale-watch encounters. The unknown positions of the whale-watching boats when they are not with whales, the 360 degree fields-of-view on-board, the variable boat speed, and the repeated encounters of the same individuals throughout each day can all invalidate the independence assumption required to fit the desired models.
In our workshop, we will see some potential solutions to address this issue of independence. In this session, we will perform some exploratory analyses to investigate the suitability of the independence assumption, and to investigate possible ways of estimating observer effort.
Let’s look at the whale-watching data. Specifically, let’s investigate the 2nd, 3rd and 4th encounters in the dataset. Notice that they were made on the same day and very close together in time. Often, little information is provided on the database management procedures. For example, are these multiple encounters with the same individual? Would the database discard or retain problematic repeated encounters?
Sightings_DRWW_sp@data[2:4,c('WS_TIME','WS_DATE','PLATFORM')]
## # A tibble: 3 x 3
## WS_TIME WS_DATE PLATFORM
## <Period> <dttm> <chr>
## 1 16H 5M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 2 16H 23M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 3 16H 32M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
Without additional information available on the database management protocols, we can investigate whether these encounters could be of the same individual. To investigate this hypothesis, we can compute the distance between the encounters using the gDistance() function, the time between the encounters, and the required travel speed of the individual under the assumed hypothesis.
# How far away are these encounters in space (in kilometers)?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)
## 1 2 3
## 1 0.000000 1.316181 3.637082
## 2 1.316181 0.000000 2.386592
## 3 3.637082 2.386592 0.000000
# How far away are these encounters in time (in minutes)?
Sightings_DRWW_sp[3:4,]$WS_TIME-Sightings_DRWW_sp[2:3,]$WS_TIME
## [1] "18M 0S" "9M 0S"
# Could this be the same animal? How fast is this implied movement?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)[cbind(c(1,2),c(2:3))] /
(c(18, 9)/60)
## [1] 4.38727 15.91061
This hypothesis would imply a traveling speed of between 4km/h and 16km/h. Is this plausible? Since fin whales are known to sustain much higher speeds (> 30km/h), we will discard all repeated encounters each day. We will keep only the first sighting made each day by each company.
Note that there is no overlap between the sighting locations from the two whale-watch companies, indicating that an assumption of independent encounters between the two companies could be reasonable. In fact, the distance between the average locations of the encounters made by the two companies is around 80km (see the code below). For an 8-hour whale watch day, this would require a 10km/h persistant movement between the two regions for multiple encounters to occur.
# What is the average distance between the encounters from the two companies?
gDistance(gCentroid(Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='BRIER ISLAND WHALEWATCH',]),gCentroid(Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='QUODDY LINK',]))
## [1] 78.27287
Sightings_DRWW_sp$PLATFORM <- as.factor(Sightings_DRWW_sp$PLATFORM)
ggplot() + gg(Domain) + gg(Sightings_DRWW_sp, colour=Sightings_DRWW_sp$PLATFORM_CODE)
Thus we subset the data to keep only the initial encounters from each company:
Sightings_Brier_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='BRIER ISLAND WHALEWATCH',]
Sightings_Brier_nodup <- Sightings_Brier_nodup[!duplicated(Sightings_Brier_nodup$WS_DATE),]
Sightings_Quoddy_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='QUODDY LINK',]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[!duplicated(Sightings_Quoddy_nodup$WS_DATE),]
dim(Sightings_DRWW_sp)[1]; dim(Sightings_Brier_nodup)[1]; dim(Sightings_Quoddy_nodup)[1];
## [1] 567
## [1] 85
## [1] 144
Notice that we have a total of 229 whale watch encounters after the subsetting procedure, down from an original count of 567 encounters.
Strictly speaking, for the desired point process models to be reasonable, it is required that the initial daily encounter locations from the whale watch vessels are independent realizations from \(\lambda_{true}\), conditioned on \(\lambda_{eff}\). We will assume that the typical journeys/routes of the whale watch vessels remains constant across months and years under study. Thus, we assume that \(\lambda_{eff}\) is constant through time.
As we discussed earlier, we are unable to model temporal changes in the species density due to the spatially non-overlapping survey efforts across the months and years. Thus any changes in the species intensity from June - August, or across the years 2007-2009 will be missed. Thus, we are forced to assume that the summer species density remains constant between 2007-2009. It is the summer species density that is our target for inference.
How reasonable is our assumption that the species density remains constant across the years and across the months? Again, we can visually assess the suitability of the assumption with a plot.
If each whale-watch sighting is indeed an independent realization from a temporally static point process, then there should be no association between the spatial distances between the whale-watch encounters and how far apart in time they were made.
To plot distances against time intervals, we first compute the distances in time and the euclidean distances in space between each sighting.
difftime_Brier<- dist(as.numeric(ymd(Sightings_Brier_nodup$WS_DATE)))
diffspace_Brier <- dist(Sightings_Brier_nodup@coords)
difftime_Brier_fac <- as.factor(difftime_Brier)
difftime_Quoddy <- dist(as.numeric(ymd(Sightings_Quoddy_nodup$WS_DATE)))
diffspace_Quoddy <- dist(Sightings_Quoddy_nodup@coords)
difftime_Quoddy_fac <- as.factor(difftime_Quoddy)
We will create multiple plots with days on the x-axis and km on the y-axis. The first plots are restricted to only consider temporal differences of less than 85 days. We are plotting a sequence of boxplots to help detect patterns.
ggplot(data=data.frame(difftime =
difftime_Brier_fac[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
The second plot is a different version that uses a loess smoother curve to try to help uncover the trend.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess') +
xlab('difference in time (days)') +
ylab('difference in space (km)')
These two plots indicate some moderate autocorrelation between the encounters, with encounters made ~1 month apart typically being around 50% further away than those made 24 hours apart. Notice that the trend plateaus after ~14 days. Perhaps this trend is due to the autocorrelation being caused exclusively by persistence in animal movement? Furthermore, the flat-line trend between day ~30 and day ~75 could be evidence in favour of an approximately constant density across the summer months. If the space use did indeed change dramatically across the months (e.g. due to migration), then we may expect to see a monotonically increasing trend over time.
The third plot investigates the space-time relationship across years.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[],
diffspace = as.numeric(diffspace_Brier)[]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')+
xlab('difference in time (days)') +
ylab('difference in space (km)')
This third plot shows no evidence that the average spatial separation between whale watch encounters differs year-to-year. This supports our assumption that the spatial density can be assumed to be constant across the years, albeit within the (very) small spatial region visited by the whale-watch boats departing from Brier Island.
We now repeat the plots, but for the whalewatch boats that depart from Quoddy, starting with the loess smoother curve on the first 84 days.
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
In this plot, we see a large separation in the y-axis. This is caused by a single encounter that was made (far) East of the port. This is seen in the map created above, as the right-most cluster of blue points. Note that both encounters seen in the cluster were made on the same day (we only kept the initial encounter). The plot above indicates that these encounters were made over 90km from port! From hereon out, we will discard this encounter and assume the location was miscoded.
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[which(gDistance(Sightings_Quoddy_nodup, gCentroid(Sightings_Quoddy_nodup),byid=T)<50),]
ggplot(data=data.frame(difftime =
difftime_Quoddy_fac[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
These two plots also show evidence of potential autocorrelation existing upto ~14 days. This could be due to persistence of animal movement, for example, due to a few whales remaining in the same area for a few days.
As we see below, there is no evidence for across year differences.
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(diffspace_Quoddy)<50]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')+
xlab('difference in time (days)') +
ylab('difference in space (km)')
We have seen some evidence that residual autocorrelations remain in the encounters data. This is despite subsetting the data to the initial daily encounters. The result of this may be overdispersion. We may be able to partially correct for this by using of a log-Gaussian Cox process (introduced in lecture 2). By including latent effects (e.g. spatial fields) in the model, we can hopefully capture some of this additional clustering. In fact, log-Gaussian Cox processes are always overdispersed relative to Poisson processes.
However, trying to determine which latent effects are really describing \(\lambda_{true}\) and which are simply capturing the movement persistence is problematic. For example, if we falsely attribute too much variability and clustering to movement persistence, then we risk underestimating the uncertainty with our estimates of \(\lambda_{true}\)! This could lead to over-confident inference.
It is common for multiple competing models to be fit. Deciphering which model fits the data ‘best’ using information criterion alone (AIC, DIC, etc.,) can be problematic. Performing model comparisons using the same dataset as were used to fit the models and perform exploratory analysis can be problematic (overfitting can occur). An alternative approach is to test both the predictive accuracy and the estimated uncertainty on an unseen dataset.
We now subset the data into a training and test set. We choose the training set to contain all the encounters from 2007-2009. For the test data, we choose all the encounters from 2011. See Additional tips for an explanation of why the potential repeated sighting in 2011 between survey and whale-watch data means that the whale-watching data cannot be added to the training dataset. Note that we perform all remaining exploratory analysis on the training data from hereon out.
Below, we split the data into training and test data.
Sightings_Brier_nodup_test <- Sightings_Brier_nodup[which(Sightings_Brier_nodup$YEAR==2011),]
Sightings_Brier_nodup <- Sightings_Brier_nodup[which(Sightings_Brier_nodup$YEAR!=2011),]
Sightings_Quoddy_nodup_test <- Sightings_Quoddy_nodup[which(Sightings_Quoddy_nodup$YEAR==2011),]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[which(Sightings_Quoddy_nodup$YEAR!=2011),]
Sightings_survey_test <- Sightings_survey[which(Sightings_survey$YEAR==2011),]
Sightings_survey <- Sightings_survey[which(Sightings_survey$YEAR!=2011),]
Effort_survey_test <- Effort_survey[which(Effort_survey$YEAR=='2011'),]
Effort_survey <- Effort_survey[which(Effort_survey$YEAR!='2011'),]
# How many survey sightings by year?
table(Sightings_survey$YEAR)
##
## 2007 2008 2009
## 45 17 1
Next, we evaluate a crude estimate of the total amount of survey effort spent each year, as measured by the total trackline length. Using this, we then crudely estimate a ‘catch per unit effort’ (CPUE) measure by year.
# How much survey effort (in trackline length) is there by year
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum)
## Effort_survey$YEAR: 2007
## [1] 11643.15
## ------------------------------------------------------------
## Effort_survey$YEAR: 2008
## [1] 571.9083
## ------------------------------------------------------------
## Effort_survey$YEAR: 2009
## [1] 819.9337
# Crude estimate of relative `CPUE'
(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))/
min(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))
##
## 2007 2008 2009
## 3.168989 24.372565 1.000000
# How many WW sightings by year
table(Sightings_Quoddy_nodup$YEAR)
##
## 2007 2008 2009
## 39 24 44
table(Sightings_Brier_nodup$YEAR)
##
## 2007 2008 2009
## 24 19 19
The ‘CPUE’ values are highly variable! The wild differences could be due to the small efforts in years 2008 and 2009 relative to 2007. Alternatively, they could be due to differences in survey protocols that took place each year. Or, the differences could actually reflect real differences in whale density across the three distinct regions visited each year! Without taking a model-based approach and controlling for differences in spatial effort, it is unclear how to combine the encounters from these three surveys.
As we will see in the next tutorial, by taking a model-based approach to inference, we will be able to combine the data from all three surveys to estimate the whale density across space. Unfortunately, due to the lack of spatial overlap between the surveys’ efforts (tracklines), we are unable to account for any differences between the survey protocols within a model without strong prior information available. For example, an intercept added to the model for each survey would be confounded by the spatial field. From now on, we assume that each survey is equivalent.
To explore which detection functions (\(p(d)\)) could be used with the survey data, we will first plot the histogram of measured distances from the plane (i.e. the observed distances of encounters from the trackline).
# What is the maximum detection distance?
ggplot(Sightings_survey@data, aes(x=DISTANCE)) + geom_histogram() + xlab('Distance (m)')
The histogram of the perpendicular distances from the aircraft shows an apparent maximum detection probability at a value of distance above 0 metres! Low detection probability close to 0 metres is common for aerial surveys since it is harder to see below the plane. Common approaches include truncating the data, and fitting a two-parameter distance sampling function that allows for a non-monotonic relationship to exist.
Given the limited size of the survey data and the scope of this workshop, we choose not to pursue either option and we instead continue with estimating a single-parameter detection probability function. This crude approach assumes the detection probability is a monotonically decreasing function of distance.
In addition, it is often advised in distance sampling applications to threshold our upper detection distances at a ‘sensible’ value. For the histogram above, it looks like 2km could be a reasonable threshold value to choose. Let’s set all values above 2km equal to 2km and scale the distances onto the km units scale. This rescaling to units of km is crucial for numerical stability. We define a variable ‘distance’ that contains the rescaled distances. Note that a few distances are missing. We impute these with the mean distance.
# Setting the distances <250 to 250
# Sightings_survey@data$DISTANCE <- ifelse(Sightings_survey@data$DISTANCE>250,Sightings_survey@data$DISTANCE,250)
# Setting the distances >2000 to 2000
Sightings_survey@data$DISTANCE <-
ifelse(Sightings_survey$DISTANCE>2000 & !is.na(Sightings_survey$DISTANCE),
2000,Sightings_survey$DISTANCE)
# Needs renaming to match the formula argument
Sightings_survey$distance <- Sightings_survey$DISTANCE
# Impute missing distances with the mean
Sightings_survey$distance[is.na(Sightings_survey$distance)] <- mean(Sightings_survey$distance,na.rm=T)
# Remove the old DISTANCE column
Sightings_survey <- Sightings_survey[,-c(8)]
# Rescaling to km
Sightings_survey$distance <- Sightings_survey$distance / 1000
# Plot the modified distances
ggplot(Sightings_survey@data, aes(x=distance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The single-parameter detection probability function we will use is the half-normal detection function:
\[p(d) = exp\left(\frac{-d^2}{2\sigma^2}\right)\]
In lecture 1 we showed how we could attempt to model effort with a set of informative covariates. This approach is most useful for opportunistic datasets which lack quality information on the observers’ locations. Throughout our tutorials, we will use distance from port as a covariate of effort for the the whale-watch data.
Here, we investigate a potential functional form for the distance from port covariate. To do this, we plot a histogram of the distances from port at each of the whale watch sighting locations.
# Plot the encounters with distance from the port
hist(over(Sightings_Brier_nodup,Dist_Brier)$Dist_Brier, main='Histogram of the distance from port of the Brier encounters', xlab = 'Distance from port')
hist(over(Sightings_Quoddy_nodup,Dist_Quoddy)$Dist_Quoddy,breaks=20, main='Histogram of the distance from port of the Quoddy encounters', xlab = 'Distance from port')
For both ports, we detect a decreasing frequency of encounters made as the distance from port increases. Based on the histograms, we choose to model the functional form as a half-normal function. More complicated functional forms (e.g. weibull or hazard functions) could also be used.
More specifically, let the whale-watch effort intensity from the \(i^{th}\) port be denoted \(\lambda_{eff, i}\) and let the location of the \(i^{th}\) port be denoted \(s_i^*\). Finally, let the scalar parameter \(\lambda_i\) be unique for each port. Then we assume the following functional form for the whale watch effort intensity from the \(i^{th}\) port: \[\lambda_{eff,i}(s) = \lambda_i exp\left(\frac{-||s-s_i^*||^2}{2\sigma_i^2} \right)\].
On the log scale, this covariate can be estimated as a linear effect on the squared distance from port. Thus, we create SpatialPixelsDataFrame objects which store the squared distances from port.
Dist_Brier_sq <- as(Dist_Brier,'SpatialPixels')
Dist_Brier_sq$Dist_Brier_sq <- Dist_Brier$Dist_Brier^2
Dist_Quoddy_sq <- as(Dist_Quoddy,'SpatialPixels')
Dist_Quoddy_sq$Dist_Quoddy_sq <- Dist_Quoddy$Dist_Quoddy^2
In this section, we will investigate trends between the observed intensity and the covariate bathymetry using only the training data. We will also see why taking a log transform of Slope was a sensible thing to do.
A quick first step is to evaluate whether the empirical densities of a covariate at the encounter locations (representing habitat used by the whales) differs from the empirical density across the domain (representing available habitat values). Such difference may indicate that the whales are associated with a subset of the available habitat. Of course, any differences may simply reflect differences in the types of habitat visited by observers!
To extract the covariate values at each location, we will use the function over() from the sp package.
# Create a SpatialPointsDataFrame containing all the encounter locations
All_obs <- rbind(Sightings_Brier_nodup[,'YEAR'],
Sightings_Quoddy_nodup[,'YEAR'],
Sightings_survey[,'YEAR'])
# Extract the value of slope at the observation locations
Slope_obs <- over(All_obs, Slope)
First, we plot the empirical densities of Slope at the encounters locations and throughout the domain (study area).
ggplot(data=data.frame(Slope_obs = Slope_obs$FIWH_MAR_Slope),
aes(x=Slope_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Slope_dom = as.numeric(Slope$FIWH_MAR_Slope)),
aes(x=Slope_dom, colour='Domain')) + xlab('Slope') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
We see a very skewed distribution of Slope with a huge range of values. Leaving the covariate in this form risks wild predictions of intensity being made! A log transform may reduce the skew of the distribution and help to stabilise predictions. Let’s see this:
# Extract the value of slope at the observation locations
log_Slope_obs <- over(All_obs, log_Slope)
ggplot(data=data.frame(Slope_obs = log_Slope_obs$log_Slope),
aes(x=Slope_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Slope_dom = as.numeric(log_Slope$log_Slope)),
aes(x=Slope_dom, colour="Domain")) +
xlab('log Slope') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
These plots suggest that encounter locations are associated with regions with higher slope compared with what is available in the study area. However, these results are only investigating the associations between \(\lambda_{obs}\) and the covariates and not the desired associations between \(\lambda_{true}\) and the covariates. The associations may simply reflect the types of regions preferentially visited by observers! Note that the last plot shows a lack of overlap in the highest log_Slope values (around 2). Consequently, detected associations between log_Slope and animal density may become extrapolated beyond the range of observed values! We will address this in the later sessions.
inlabruIn lecture 1, we introduced the simplest distributional assumption on the number of encounters in a region. We assumed that the distribution of the number of encounters within any region \(A\) was Poisson distributed, with mean equal to the integrated intensity surface:
\[N(Y \cap A) \backsim Poisson(\int \lambda_{obs}(s) ds)\].
Remember that \(N(Y \cap A)\) represents the number of encounters in \(A\). We will see in lecture 2 that this equation actually defines an inhomogeneous Poisson process, where we assume that the rate of encounters with whales changes through space according (only) to a set of measured covariates (i.e., the rate is spatially-inhomogeneous). The above distribution for \(N(Y \cap A)\) will hold for any number of non-overlapping regions \(A_i \subset \Omega\). Furthermore, we will see that if we define a grid of non-overlapping cells \(A_i\) that cover (partition) the domain \(\Omega\), then as the number of cells of the grid increases, the (Poisson) likelihood of the counts (falling within each \(A_i\)) will converge to the likelihood of the inhomogeneous Poisson process that preserves the exact locations of the encounters! Do not panic if this point is unclear - we will cover it in depth in lecture 2.
Importantly, the inlabru package can fit inhomogeneous Poisson processes (and log-Gaussian Cox processes) to the data using a similar approximation technique to the one defined above! Thus, we can use inlabruto investigate trends between the observed intensity and the log slope covariate using the encounter data. However, inlabru can go even further! It is able to approximate the point process likelihoods when the observers have followed distance sampling protocols along a set of known tracklines and is also able to model the whale watch effort using a set of covariates! Thus, we will also use inlabruto investigate trends between the true whale intensity and the log slope covariate.
In the next 2 lectures and sessions, we will explore inlabru in depth. We will explore the package’s ability to fit both inhomogeneous Poisson processes and log-Gaussian Cox processes and we will learn the syntax in depth. For now, we will use only the output from inlabru models for the purposes of exploratory analyses. We will save the explanation of the code until the next lectures and sessions.
Ignoring effort for the time being, we will use inlabru to fit an inhomogeneous Poisson point process to investigate the relationship between \(\lambda_{obs}(s)\) and log Slope. To do this, we simply ignore effort completely and fit the observed encounter locations to a set of covariates of interest. We do this below (in the hidden code), with log_Slope as our covariate of interest. The output from the model is shown below:
# These inlabru options will be explained in depth later
bru_options_set(
bru_verbose = FALSE,
verbose = FALSE,
control.inla=list(int.strategy='eb'),
num.threads=2
)
# Define components (explained later)
cmp_IPP1 <- ~
log_Slope(main = log_Slope, model='linear') +
Intercept(1)
# Define formulae for 3 data sources (explained later)
formula_IPP1_survey <- coordinates ~ Intercept + log_Slope
formula_IPP1_Quoddy <- coordinates ~ Intercept + log_Slope
formula_IPP1_Brier <- coordinates ~ Intercept + log_Slope
# Load in a pre-compiled mesh (explained later)
mesh_land <- readRDS('./Data/mesh_land_exp.rds')
# Define likelihoods for 3 data sources (explained later)
like_IPP1_survey <- like(data=Sightings_survey,
samplers=Domain,
domain = list(
coordinates = mesh_land
),
formula = formula_IPP1_survey,
family = 'cp')
like_IPP1_Quoddy <- like(data=Sightings_Quoddy_nodup,
samplers=Domain,
domain = list(
coordinates = mesh_land
),
formula = formula_IPP1_Quoddy,
family = 'cp')
like_IPP1_Brier <- like(data=Sightings_Brier_nodup,
samplers=Domain,
domain = list(
coordinates = mesh_land
),
formula = formula_IPP1_Brier,
family = 'cp')
# fit the model (explained later)
mod_IPP1 <- bru(like_IPP1_survey,
like_IPP1_Quoddy,
like_IPP1_Brier,
components=cmp_IPP1)
# We will focus on the output for now!
summary(mod_IPP1)
## inlabru version: 2.2.5
## INLA version: 21.01.08-1
## Components:
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor: coordinates ~ Intercept + log_Slope
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor: coordinates ~ Intercept + log_Slope
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor: coordinates ~ Intercept + log_Slope
## Time used:
## Pre = 5.75, Running = 0.49, Post = 0.517, Total = 6.75
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## log_Slope 0.20 0.048 0.104 0.200 0.295 0.200 0
## Intercept -8.17 0.066 -8.302 -8.169 -8.043 -8.168 0
##
## Expected number of effective parameters(stdev): 2.00(0.00)
## Number of equivalent replicates : 795.47
##
## Deviance Information Criterion (DIC) ...............: 4265.52
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 2.00
##
## Watanabe-Akaike information criterion (WAIC) ...: 4265.55
## Effective number of parameters .................: 2.02
##
## Marginal log-Likelihood: -2143.33
## Posterior marginals for the linear predictor and
## the fitted values are computed
We ignore all the details regarding the code for now and look at the table of fixed effects. We find a ‘significant’ positive association between \(\lambda_{obs}\) and log slope and thus some evidence that the whales may be found more often in regions with higher slope. The ‘significance’ comes from the fact that the 95% posterior credible intervals do not cover 0 and the credible intervals are valid when the assumed Poisson distribution on the counts is true. However, the cause of this ‘significant’ association may be due to either the observers or the whales spending more time in regions of high slope!
To investigate whether this association with log Depth is in fact with the true whale intensity \(\lambda_{true}\), we can attempt to adjust for effort and see if these estimates change. We once again fit an Inhomogeneous Poisson point process model, but this time we include the survey line transects (Effort_survey) and the two squared distance from port covariates (Dist_Brier_sq,Dist_Quoddy_sq). We also define a half-normal detection probability function for the survey data.
More specifically, we are specifying a model for both the log effort intensity and the log whale intensity. For the log effort intensity we are estimating a unique intercept parameter for the two whale watch observers and we are assuming that all encounters with whales at a distance of 0 metres from the survey aircraft are detected. We are assuming that the effort from the two whale watch companies is explained by a half-normal function of distance that we estimate. Similarly, we assume that the probability of detecting a whale decreases as a half-normal function of distance from the trackline (\(p(d)\)) for the observers onboard the aircrafts. Thus, our model is:
\[\lambda_{obs}(s) = \lambda_{eff}(s) \lambda_{true}(s)\\ \lambda_{eff,i}(s) = \lambda_i exp\left(\frac{-||s-s_i^*||^2}{2\sigma_i^2} \right) \textrm{ for WW data} \\ \lambda_{eff,surv,j}(s) = p(d_j(s)) \textrm{ for survey data}\\ \lambda_{true}(s) = \beta_0 + \beta_1 X(s).\]
For the two whale watch companies, \(\lambda_{eff,i}(s)\), \(s_i^*\) denotes the location of the whale watch port, \(\lambda_i\) denotes the intercept for the whale watch company, and \(\sigma_i\) determines how far from port the whale watch vessels travel. For the survey data, we compute the effort uniquely for each survey trackline (indexed by \(j\)) with \(p(d_j(s))\) denoting the perpendicular distance from the \(j^{th}\) trackline segment (if it exists). We will explain this in more depth in the upcoming lectures. Finally, we assume the true species intensity is driven exclusively by log depth (\(X(s)\)).
Again, the details of the code are not important to understand right now. We will cover the inlabru syntax in depth in the following tutorials and lectures. What we will focus on is whether or not we detect an association between log slope after making this adjustment for effort.
# Define the half-normal detection probability function
log_hn = function(distance, lsig){
-0.5*(distance/exp(lsig))^2
}
# Define the components (ignore for now)
cmp_IPP2 <- ~
log_Slope(main = log_Slope, model='linear') +
Intercept(1) +
Intercept_Quoddy(1) +
Intercept_Brier(1) +
Dist_Brier_sq(main = Dist_Brier_sq, model='linear') +
Dist_Quoddy_sq(main = Dist_Quoddy_sq, model='linear') +
lsig(1)
# Define the formulae (ignore for now)
formula_IPP2_survey <- coordinates + distance ~ Intercept + log_Slope + log_hn(distance, lsig)
formula_IPP2_Quoddy <- coordinates ~ Intercept + Intercept_Quoddy + log_Slope + Dist_Quoddy_sq
formula_IPP2_Brier <- coordinates ~ Intercept + Intercept_Brier + log_Slope + Dist_Brier_sq
# Define the likelihood objects (ignore for now)
like_IPP2_survey <- like(data=Sightings_survey,
samplers=Domain,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))
),
formula = formula_IPP2_survey,
family = 'cp')
like_IPP2_Quoddy <- like(data=Sightings_Quoddy_nodup,
samplers=Domain,
domain = list(
coordinates = mesh_land
),
formula = formula_IPP2_Quoddy,
family = 'cp')
like_IPP2_Brier <- like(data=Sightings_Brier_nodup,
samplers=Domain,
domain = list(
coordinates = mesh_land
),
formula = formula_IPP2_Brier,
family = 'cp')
# Fit the model
mod_IPP2 <- bru(like_IPP2_survey,
like_IPP2_Quoddy,
like_IPP2_Brier,
components=cmp_IPP2)
# We will focus on the output!
summary(mod_IPP2)
## inlabru version: 2.2.5
## INLA version: 21.01.08-1
## Components:
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## Intercept_Quoddy: Model types main='linear', group='exchangeable', replicate='iid'
## Intercept_Brier: Model types main='linear', group='exchangeable', replicate='iid'
## Dist_Brier_sq: Model types main='linear', group='exchangeable', replicate='iid'
## Dist_Quoddy_sq: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ Intercept + log_Slope + log_hn(distance,
## lsig)
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor: coordinates ~ Intercept + Intercept_Quoddy + log_Slope + Dist_Quoddy_sq
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor: coordinates ~ Intercept + Intercept_Brier + log_Slope + Dist_Brier_sq
## Time used:
## Pre = 7.21, Running = 8.65, Post = 1.21, Total = 17.1
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## log_Slope 0.245 0.060 0.128 0.245 0.363 0.244 0
## Intercept -8.598 0.168 -8.939 -8.594 -8.281 -8.586 0
## Intercept_Quoddy 6.694 0.222 6.263 6.692 7.133 6.689 0
## Intercept_Brier 3.823 0.226 3.382 3.822 4.269 3.820 0
## Dist_Brier_sq -206.689 22.312 -251.856 -206.209 -164.223 -205.254 0
## Dist_Quoddy_sq -348.704 22.194 -392.865 -348.502 -305.702 -348.101 0
## lsig 0.063 0.152 -0.249 0.069 0.347 0.079 0
##
## Expected number of effective parameters(stdev): 6.01(0.00)
## Number of equivalent replicates : 2450.59
##
## Deviance Information Criterion (DIC) ...............: 2738.81
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 5.98
##
## Watanabe-Akaike information criterion (WAIC) ...: 2763.62
## Effective number of parameters .................: 27.93
##
## Marginal log-Likelihood: -1474.69
## Posterior marginals for the linear predictor and
## the fitted values are computed
After making this adjustment for heterogeneous effort, we see that log slope is still found to be significantly associated with the whale density, with a positive coefficient. The negative values of the two Dist_X_sq effects implies that the estimated amount of whale watch effort decreases quickly as the ‘distance’ from port increases. The lsig parameter determines how quickly the distance sampling detection probability decays to 0 as the distance from the survey trackline increases.
We plot the effort intensity surfaces now. Again, the details of the code can be ignored for now - we will explain these in depth in the future sessions and lectures! First we plot the fitted effort intensity surface \(\lambda_{effort}\) from the two whale watch ports:
pixels_plot <- pixels(mesh_land, mask=Domain)
pred_Brier <- predict(mod_IPP2, pixels_plot, ~exp(Intercept_Brier + Dist_Brier_sq), n.samples=20)
pred_Quoddy <- predict(mod_IPP2, pixels_plot, ~exp(Intercept_Quoddy + Dist_Quoddy_sq), n.samples=20)
multiplot(
ggplot() + gg(pred_Brier) + ggtitle('Brier Effort'),
ggplot() + gg(pred_Quoddy) + ggtitle('Quoddy Effort'),
layout = matrix(c(1,2),nrow=1,ncol=2))
We see that the model predicts the vessels from Brier to ‘spread out’ more than those from Quoddy. Lastly, we plot the estimated detection probability function for the survey data against distance with 95% posterior credible intervals:
dist_df <- data.frame(distance=seq(from=0,to=2,length.out = 30))
samp_IPP2 <- generate(mod_IPP2,n.samples=20)
pred_dist <- sapply(samp_IPP2,FUN = function(x){exp(log_hn(dist_df$distance,x$lsig))})
dist_df$mean <- apply(pred_dist, 1, mean)
dist_df$UCL <- apply(pred_dist, 1, quantile, probs=c(0.975))
dist_df$LCL <- apply(pred_dist, 1, quantile, probs=c(0.025))
ggplot(dist_df, aes(x=distance,y=mean,ymax=UCL,ymin=LCL)) +
geom_line() + geom_ribbon(alpha=0.4) + ylab('Probability of detection') +
xlab('Distance from trackline (km)')
In summary, from this exploratory analysis, we have found evidence that the log slope is associated with the true whale intensity. This association remains, even after attempting to control for effort. Thus, log slope should be considered as a covariate in our future models.
Note that in practice it can be a good idea to play around with different covariates and functional forms (e.g. quadratic effects) at this stage. These inhomogeneous Poisson process models are much quicker to fit than the log-Gaussian Cox process models we will be fitting in the future sessions. Furthermore, these inhomogeneous Poisson process models will give us a good understanding of which covariates should be considered.
Unfortunately, by being in the inhomogeneous Poisson process class of point processes, the above models lack one crucial feature. They are unable to account for any additional spatial correlations/clustering that are frequently caused by unmeasured environmental covariates and biological processes acting behind the scenes. These additional spatial clusterings (correlations) are likely to be present in our example: do we really expect log depth to fully explain the whales’ presence?… Failing to account for these additional spatial correlations may lead to the reported measures of uncertainty in both predictions and in inference to be inaccurate.
This is why we turn to log-Gaussian Cox processes in the next tutorial. By assuming the intensity surface to be a realisation from a spatially-smooth (log-Gaussian) random field, we will be able to adjust our inference to take into account any additional spatial correlations and/or unmeasured covariates! By doing this, both parameter estimates and measures of uncertainty should be better adjusted to account for these autocorrelations. We will describe log-Gaussian Cox processes in depth in the next lecture.
If you got stuck on any of the exercises, then please feel free to try them again. Here are links to the problems:
For publication, there can be issues regarding copyright of Google Maps. Using OpenStreetMap can help. To guarantee this simply add the following argument: source='osm', force=TRUE to gmap(). Double check the console that the maps are indeed being downloaded from stamen or osm. For brevity we have suppressed the messages.
gmap(Sightings_survey, source='osm', force=TRUE) +
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
Note for this to work it needs to be in the original project (lat/lon), not UTM.
The multiplot() function is a very flexible function that enables publication-quality figures to be made with relative ease.
To change the order of the plots you can change the argument byrow=TRUE to byrow=FALSE.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(1:4, nrow=2, ncol=2, byrow = FALSE))
We can also change the size of the figures by changing the matrix in the layout.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
As you can see plots of different size stretches the maps, to keep the set projection we can use coord_fixed(ratio = 1).
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
We can define the colour palette easily.
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"PuBuGn")),
limits = range(...))
}
Look at ?RColorBrewer::brewer.pal to see what other colour palettes are available.
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
colsc(Slope@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
Have a go at creating your own colour palette function. Investigate the effects of changing both arguments to brewer.pal.
colsc2 <- function(...){
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(7,"Spectral")),
limits = range(...))
}
multiplot(ggplot() +
gg(Domain) +
gg(Slope) + xlab('East(km)') + ylab('North(km)') + labs(fill='Slopeetry') +
colsc2(Slope@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
Could we not also put 2011’s whale-watch data into the training set? Below is some code showing that we cannot. A whale is sighted by at least one whale watch company on every day in 2011 that a survey detects a whale. The plot below shows that these encounters could have been of the same animal. Even if they weren’t, the tracklines in 2011 still visited areas that the whale watch vessels were present. Since our training and test datasets are required to be independent of each other, we must therefore also remove the 2011 whale watch encounters from the training data.
# what dates were survey encounters made on in 2011?
unique(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011])
## [1] "2011-06-21" "2011-08-12" "2011-08-13"
# Are encounters by the WW vessels made on these dates?
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][1],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][2],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][3],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
# Could they be of the same animal?
ggplot() + gg(Domain) +
gg(Sightings_survey_test[Sightings_survey_test$YEAR==2011,],colour='green') +
gg.spatiallines_mod(Effort_survey_test[Effort_survey_test$YEAR==2011,],colour='yellow')
The code for hiding the Rmd code chunks came from Martin Schmelzer, found here
References/Sources for data sets:
DFO Maritimes Region Whale Sightings Database
MacDonald, D., Emery, P., Themelis, D., Smedbol, R.K., Harris, L.E., and McCurdy, Q. 2017. Marine mammal and pelagic animal sightings (Whalesightings) database: a users guide. Can. Tech. Rep. Fish. Aquat. Sci. 3244: v + 44 p.
NOAA NARW Surveys
Timothy V.N. Cole. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA
2007 TNASS DFO Aerial Survey
Lawson. J.W., and Gosselin, J.-F. 2009. Distribution and preliminary abundance estimates for cetaceans seen during Canada’s marine megafauna survey - A component of the 2007 TNASS. DFO Can. Sci. Advis. Sec. Res. Doc. 2009/031. vi + 28 p.
NOAA Cetacean Surveys
Timothy V.N. Cole and D. Palka. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA